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Abstract 

An undirected graphical model is a joint probability distribution defined on an undirected graph G* , 
where the vertices in the graph index a collection of random variables and the edges encode conditional 
independence relationships amongst random variables. The undirected graphical model selection (UGMS) 
problem is to estimate the graph G* given observations drawn from the undirected graphical model. 

f^ ' This paper proposes a framework for decomposing the UGMS problem into multiple subproblems over 

^\1 I clusters and subsets of the separators in a junction tree. The junction tree is constructed using a graph 

that contains a superset of the edges in G* . We highlight three main properties of using junction 
Oj' trees for UGMS. First, different regularization parameters or different UGMS algorithms can be used 

.^^ , to learn different parts of the graph. This is possible since the subproblems we identify can be solved 

independently of each other. Second, under certain conditions, a junction tree based UGMS algorithm can 

t^^ ' produce consistent results with exponentially fewer observations than the usual requirements of existing 

algorithms. Third, both our theoretical and experimental results show that the junction tree framework 
does a significantly better job at finding the weakest edges in a graph than existing methods. This 
1 I property is a consequence of both the first and second properties. Finally, we note that our framework is 

independent of the choice of the UGMS algorithm and can be used as a wrapper around standard UGMS 
algorithms for more accurate graph estimation. 

_2 ' Keywords: Graphical models; Markov random fields; Junction trees; model selection; graphical model 

C/3 I selection; high-dimensional statistics; graph decomposition. 

Kj^ ■ 1. Introduction 

^^ , An undirected graphical model is a joint probability distribution Px of a random vector X defined on 

Qv ■ an undirected graph G* . The graph G* consists of a set of vertices V = {l,...,p} and a set of edges 

'!;;j- , E{G* ) C U X U. The vertices index the p random variables in X and the edges E{G* ) characterize conditional 

independence relationships amongst the random variables in X [25J . We study undirected graphical models 
f^ , (also known as Markov random fields) so that the graph G* is undirected, i.e., if an edge (i,j') € E{G*), 

Cn ' then (j, i) £ E{G*). The undirected graphical model selection (UGMS) problem is to estimate G* given n 

observations X" = {X'^^\ . . . ,X("^) drawn from Px- This problem is of interest in many areas including 
biological data analysis, financial analysis, and social network analysis; see [21] for some more examples. 






This paper studies the following problem: Given the observations X" drawn from Px and 
a graph H that contains all the true edges E{G*), and possibly some extra edges, estimate the 
graph G* . 

A natural question to ask is how can the graph H be selected in the first place? One way of doing so is to 
use screening algorithms, such as [13] and [17] j to eliminate edges that are clearly non-existent in G* . Another 
method can be to use partial prior information about X to remove unnecessary edges. For example, this 
could be based on (i) prior knowledge about statistical properties of genes when analyzing gene expressions, 
(ii) prior knowledge about companies when analyzing stock returns, or (iii) demographic information when 
modeling social networks. Yet another method can be to use clever model selection algorithms that estimate 
more edges than desired. Assuming an initial graph H has been computed, our main contribution in this 
paper is to show how a junction tree representation of H can be used as a wrapper around UGMS algorithms 
for more accurate graph estimation. 
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Figure 1: We want to estimate the graph in (a) using the graph (b). Our framework computes the junction 
tree in (c) and uses a region graph representation in (d) of the junction tree to decompose the UGMS problem 
into muhiple subproblems. 




Figure 2: Structure of the graph G* we assume to analyze the junction tree framework for UGMS. 

1.1 Overview of the Junction Tree Framework 

A junction tree is a tree-structured representation of an arbitrary graph [39 . The vertices in a junction tree 
are clusters of vertices from the original graph. An edge in a junction tree connects two clusters. Junction 
trees are used in many applications to reduce the computational complexity of solving various graph related 
problems [1]. Figure [IJc) shows an example of a junction tree for the graph in Figure [TJb). Notice that each 
edge in the junction tree is labeled by the set of vertices common to both clusters connected by the edge. 
These set of vertices are referred to as a separator. 

Let H he a graph that contains all the edges in G* . We show that the UGMS problem can be decomposed 
into multiple subproblems over clusters and subsets of the separators in a junction tree representation of H. 
In particular, using the junction tree, we construct a region graph, which is a directed graph over clusters 
of vertices. An example of a region graph for the junction tree in Figure [TJc) is shown in Figure [IJd). The 
first two rows in the region graph are the clusters and separators of the junction tree, respectively. The 
rest of the rows contain subsets of the separatorf|3- The multiple subproblems we identify correspond to 
estimating a subset of edges over each cluster in the region graph. For example, the subproblem over the 
cluster {1,2,3,5} in FigureJUd) estimates the edges (1,2), (1,3), and (1,5). 

We solve the subproblems over the region graph in an iterative manner. First, all subproblems in the 
first row of the region graph are solved in parallel. Second, the region graph is updated taking into account 
the edges removed in the first step. We keep solving subproblems over rows in the region graph and update 
the region graph until all the edges in the graph H have been estimated. 



1.2 Advantages of Using Junction Trees 

We highlight three main advantages of the junction tree framework for UGMS. 

Choosing Regularization Parameters and UGMS Algorithms: UGMS algorithms typically depend 
on a regularization parameter that controls the number of estimated edges. This regularization parameter 
is usually chosen using model selection algorithms such as cross validation or stability selection. Since each 
subproblem we identify in the region graph is solved independently, different regularization parameters can 
be used to learn different parts of the graph. This has advantages when the true graph G* has different 
characteristics in different parts of the graph. Further, since the subproblems are independent, different 
UGMS algorithms can be used to learn different parts of the graph. Our numerical simulations clearly show 
the advantages of this property. 

Reduced Sample Complexity: One of the key results of our work is to show that in many cases, the junc- 
tion tree framework is capable of consistently estimating a graph under significantly weaker conditions than 



^see Algorithm [T] for details on how to exactly construct the region graph. 



required by previously proposed methods. For example, we show that if G* consists of two main components 
that are separated by a relatively small number of vertices (see Figure [2] for a general example), then, under 
certain conditions, the number of observations needed for consistent estimation scales like log(pniin)7 where 
Pmin is the number of vertices in the smaller of the two components. In contrast, existing methods are con- 
sistent if the observations scale like logp, where p is the total number of vertices. If the smaller component 
were, for example, exponentially smaller than the larger component, then the junction tree framework is 
consistent with about loglogp observations. For generic problems, without structure that can be exploited 
by the junction tree framework, we recover the standard conditions for consistency. 

Learning Weak Edges: A direct consequence of choosing different regularization parameters and the 
reduced sample complexity is that certain weak edges, not estimated using standard algorithms, can be 
estimated when using the junction tree framework. We show this theoretically and using extensive numerical 
simulations on both synthetic and real world data. 

1.3 Related Work 

Several algorithms have been proposed in the literature for learning undirected graphical models. Some 
examples include References [TJ[51[ni[ini[ini[SSlll3] for learning Gaussian graphical models. References |23l[?n 
[29:, 30^ for learning non-Gaussian graphical models, and References ;2, 7, 8, 18, 19,34,37] for learning discrete 
graphical models. Although all of the above algorithms can be modified to take into account prior knowledge 
about a graph H that contains all the true edges (see Appendix [B] for some examples), our junction tree 
framework is fundamentally different than the standard modification of these algorithms. The main difference 
is that the junction tree framework allows for using the global Markov property of undirected graphical models 
(see Definition 12. II) when learning graphs. This allows for improved graph estimation, as illustrated in both 
our theoretical results and numerical results. Finally, we note that all of the above algorithms can be used 
in conjunction with the junction tree framework. 

Junction trees have been used for performing probabilistic inference in graphical models '24! . This 
problem differs from the UGMS problem since the graph is assumed to be known in the inference problem. 
The use of junction trees for learning graphical models is limited to learning the direction of edges in 
directed graphical models [53]. These methods cannot be used to learn undirected graphical models. Recent 
work [31,52, has shown that solutions to the graphical lasso (gLasso) [T3] problem for UGMS over Gaussian 
graphical models can be computed, under certain conditions, by decomposing the problem over connected 
components of the graph computed by thresholding the empirical covariance matrix. The methods in [311152] 
are useful for computing solutions to gLasso for particular choices of the regularization parameter and not 
for accurately estimating graphs. Thus, when using gLasso for UGMS, we can use the methods in [3T1I52] to 
solve gLasso when performing model selection for choosing suitable regularization parameters. 

2. Preliminaries 

In this Section, we review some necessary background on graphs and graphical models that we use in this 
paper. Section [2.11 reviews some graph theoretic concepts. Section [2.21 reviews undirected graphical models. 
Section 12.31 formally defines the undirected graphical model selection (UGMS) problem. Section 12.41 reviews 
junction trees, which we use use a tool for decomposing UGMS into multiple subproblems. 

2.1 Graph Theoretic Concepts 

A graph is a tuple G = {V^E(G)), where F is a set of vertices and E{G) Q V y. V are edges connecting 
vertices in V . For any graph H, we use the notation E{H) to denote its edges. We only consider undirected 
graphs where if (wi,W2) G E[G), then (t;2,t^i) G E{G) for wi,V2 G V. Some graph theoretic notations that 
we use in this paper are summarized as follows: 

• Neighbor necii)'- Set of nodes connected to i. 

• Path {i, si, . . . , Sd,j}'- A sequence of nodes such that (z, si), {sd,j), (sfc, Sk+i) G E for k = 1, . . . ,(i — 1. 

• Separator S: A set of nodes such that all paths from i to j contain at least one node in S. The 
separator S is minimal if no proper subset of S separates i and j. 



• Induced Subgraph G[A] — {A,E{G[A])): A graph over the nodes A such that E{G[A]) contams the 
edges only involving the nodes in U. 

• Complete graph Ka- A graph that contains all possible edges over the nodes A. 
For two graphs Gi — {Vi,E{Gi)) and G2 = (V2,£'(G2)), define the following operations: 

• Graph Union: Gi U G2 = (Vi UV2,EiUE2). 



• 



Graph Difference: Gi\G2 = {Vi,Ei\E2). 



2.2 Undirected Graphical Models 

Definition 2.1 (Undirected Graphical Model |25p. An undirected graphical model is a probability distri- 
bution Px defined on a graph G* — {V, E{G*)), where V — {l,...,p} indexes the random vector X = 
(Xi, . . . , Xp) and the edges E{G*) encode the following Markov property: for a set of nodes A, B, and S, if 
S separates A and B, then Xa JL Xb\Xs. 

The Markov property outlined above is referred to as the global Markov property. Undirected graphical 
models are also referred to as Markov random fields or Markov networks in the literature. When the joint 
probability distribution Px is non-degenerate, i.e., Px > 0, the Markov properties in Definition 12.11 are 
equivalent to the pairwise and local Markov properties: 

• Pairwise Markov property: For all {i,j) ^ E, Xi A. XjlXy^^ij^ 

• Local Markov property: For aU i eV, Xi 1. Xv\{neGii)u{i}}\XneG{i)- 

In this paper, we always assume Px > and say Px is Markov on G to reflect the Markov properties. 
Examples of conditional independence relations conveyed by a probability distribution defined on the graph 
in Figure [3i;d) are Xi X Xe\{X2,X4} and X4 X Xe\{X2,X5,Xs}. 

2.3 Undirected Graphical Model Section (UGMS) 

Definition 2.2 (UGMS). The undirected graphical model selection (UGMS) problem is to estimate a graph 
G* such that the joint probability distribution Px is Markov on G* , but not Markov on any subgraph of G* . 

The last statement in Definition l2.2l is important, since, if Px is Markov on G*, then it is also Markov on 
any graph that contains G*. For example, all probability distributions are Markov on the complete graph. 
Thus, the UGMS problem is to find the minimal graph that captures the Markov properties associated with 
a joint probability distribution. In the literature, this is also known as finding the minimal I-map. 

Let ^ be an abstract UGMS algorithm that takes as inputs a set of n i.i.d. observations X" — 
{X^^\ . . . , X*^"'} drawn from Px and a regularization parameter A„. The output of ^ is a graph G„, where 
A„ controls the number of edges estimated in G„ . Note the dependence of the regularization parameter on 
n. We assume ^ is consistent, which is formalized in the following assumption. 

Assumption 1. There exists a A„ for which P{Gn = G*) ^- 1 as n ^f co, where G„ = ^(X", A„). 

We give examples of ^ in Appendix [B] Assumption [1] also takes into account the high-dimensional case 
when p depends on n in such a way that p — >■ cxd and n — >■ 00 . 

2.4 Junction Trees 

Junction trees |39| are used extensively for efficiently solving various graph related problems, see [1] for some 
examples. Reference |24) shows how junction trees can be used for exact inference (computing marginal 
distribution given a joint distribution) over graphical models. We use junction trees as a tool for decomposing 
the UGMS problem into multiple subproblems. 

Definition 2.3 (Junction tree). For an undirected graph G — {V,G{E)), a junction tree J — {C,E{J)) is 
a graph over clusters of nodes in V such that 
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Figure 3: (a) An undirected graph, (b) Not a valid junction tree since {1, 2} separates {1, 3} and {3, 4}, but 3 ^ {1, 2}. 
(c) A valid junction tree for the graph in (a), (d) A grid graph, (e) Junction tree representation of (d). 



(i) Each node in V is associated with at least one cluster in C. 

(ii) For every edge {i,j) G E{G), there exists a cluster Ck G C such that i,j G Ck- 

(Hi) J satisfies the running intersection property: For all clusters C'u, Cy, and Cw such that Cw separates 
Cu and Cy in the tree defined by E{J), C„ n Cy C C^. 



The first property in Definition 12.31 says that all nodes must be mapped to at least one cluster of the 
junction tree. The second property states that each edge of the original graph must be contained within 
a cluster. The third property, knovirn as the running intersection property, is the most important since it 
restricts the clusters and the trees that can be be formed. For example [JS], consider the graph in Figure[31[a). 
By simply clustering the nodes over edges, as done in Figure [31Jb), wg can not get a valid junction tree. By 
making appropriate clusters of size three, we get a valid junction tree in Fig. [3l^c). In other words, the 
running intersection property says that for two clusters with a common node, all the clusters on the path 
between the two clusters must contain that common node. 



Proposition 2.1 ( [SS])- Let J = {C,E{J)) he a junction tree of the graph G. 
each {Cu, Cy) G £ , we have the following properties: 

1. Buy + 0. 



Let Suv — Cu Ci Cy. For 



^. Ouy separates ^u\^uv ana i^y\ijuy- 

The set of nodes Syy on the edges are called the separators of the junction tree. Proposition 12.11 says 
that all clusters connected by an edge in the junction tree have at least one common node and the common 
nodes separate nodes in each cluster. For example, consider the junction tree in Figure [3l^e) of the graph in 
Figure [3{d). We can infer that 1 and 5 are separated by 2 and 4. Similarly, we can also infer that 4 and 6 
are separated by 2, 5, and 8. It is clear that if a graphical model is defined on the graph, then the separators 
can be used to easily define conditional independence relationships. For example, using Figure [3l^e), we 
can conclude that Xi A. X^ given X2 and X4. As we will see in later Sections, Proposition 12.11 allow the 
decomposition of UGMS into multiple subproblems over clusters and subsets of the separators in a junction 
tree. 



3. Paper Organization 

The rest of the paper is organized as follows: 

• Section 3] shows how junction trees can be represented as region graphs and outlines an algorithm for 
constructing a region graph from a junction tree. 

• Section [5] shows how the region graphs can be used to apply a UGMS algorithm to the clusters and 
separators of a junction tree. 

• Section [6] presents our main framework for using junction trees for UGMS. In particular, we show how 
the methods in Sections |4][5] can be used iteratively to estimate a graph. 

• Section [7] reviews the PC- Algorithm, which we use to study the theoretical properties of the junction 
tree framework. 



• Section [5] presents theoretical results on the sample complexity of learning graphical models using 
the junction tree framework. We also highlight advantages of using the junction tree framework as 
summarized in Section [1.2l 

• Section ini presents numerical simulations to highlight the advantages of using junction trees for UGMS 
in practice. 

• Section [TU] summarizes the paper and outlines some future work. 

4. Overview of Region Graphs 

In this Section, we show how junction trees can be represented as region graphs. As we will see in Section [SJ 
region graphs allow us to easily decompose the UGMS problem into multiple subproblems. There are many 
different types of region graphs and we refer the readers to [SIJ for a comprehensive discussion about region 
graphs and how they are useful for characterizing graphical models. The region graph we present in this 
Section differs slightly from the standard definition of region graphs. This is mainly because our goal is to 
estimate edges, while the classical region graphs defined in the literature are used for computations over 
graphical models. 

A region is a collection of nodes, which in our context can be the clusters of the junction tree, separators 
of the junction tree, or subsets of the separators. A region graph Q ~ {TZ, E{Q)) is a directed graph where the 
vertices are regions and the edges represent directed edges from one region to another. We use the notation 
E{-) to emphasize that region graphs contain directed edges. A description of region graphs is given as 
follows. 

• The set E{Q) contains directed edges so that if (_R, S) G E{Q), there exists a directed edge from region 
R to region S. 

• Whenever R — > S, then SCR. 

Algorithm[T]outlines an algorithm to construct region graphs given a junction tree representation of some 
graph H. We associate a label / with every region in TZ and group regions with the same label to partition 
TZ into L groups 7?.i, . . . ,TZl- In Algorithm [1] we initialize TZi and TZ2 to be the clusters and separators of 
a junction tree J, respectively, and then iteratively find TZ3, . . . , TZl by computing all possible intersections 
of regions with the same label. The edges in E(Q) are only drawn from a region in TZi to a region in TZi^i. 
Figure |31[c) shows an example of a region graph computed using the junction tree in Figure Hl^b) . 

Algorithm 1: Constructing region graphs 
Input: A junction tree J' = (C, E{J)) of a graph H . 
Output: A region graph Q = {TZ,E{Q)). 

1 7?.i = C, where C are the clusters of the junction tree J'. 

2 Let TZ2 be all the separators of J, i.e., TZ2 = {Suv = CuCiCy : (C„, Cy) G E{J')}. 

3 To find TZ3 , find all possible pairwise intersections of regions in TZ2 ■ Add all intersecting regions with 
cardinality greater than one to TZ3. 

4 Repeat previous step to construct TZ4 , . . . , TZl until there are no more intersecting regions of 
cardinality greater than one. 

5 For ReTZi and S e TZi+i, add the edge {R, S) to E{g) if 5 C i?. 

6 Let TZ = {TZl,..., TZl}- 



Remark 4.1. For every junction tree, Algorithm [T] outputs a unique region graph. The junction tree only 
characterizes the relationship between the clusters in a junction tree. A region graph extends the junction 
tree representation to characterize the relationships between the clusters as well as the separators. For 
example, in FigureSJ^c), the region {5, 6} is in the third row and is a subset of two separators of the junction 
tree. Thus, the only difference between the region graph and the junction tree is the additional set of regions 
introduced in TZ3, . . . , TZl- 
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Figure 4: (a) An example of H. (b) A junction tree representation of H. (c) A region graph representation 
of (b) computed using Algorithm [TJ 

Remark 4.2. Using methods in [?5], we can always construct junction trees such that the region graph 
will only have two labels. However, in this case, the size of the regions or clusters may be too large. This 
may not be desirable since the computational complexity of applying UGMS algorithms to region graphs, as 
shown in Section [51 depends on the size of the regions. 

Remark 4.3. From the construction in Algorithm [TJ TZ may have two or more regions that are the same 
but have different labels. For example, in Figure[3Uc), the region {3, 5} is in both TZ2 and TZ^. We can avoid 
this situation by removing {3, 5} from TZ2 and adding an edge from the region {1, 3, 5} in TZi to the region 
{3,5} in TZ^. For notational simplicity and for the purpose of illustration, we allow for duplicate regions. 
This does not change the theory or the algorithms that we develop. 



5. Applying UGMS to Region Graphs 

Before presenting our framework for decomposing UGMS into multiple subproblems, we first show how 
UGMS algorithms can be applied to estimate a subset of edges in a region of a region graph. In particular, 
for a region graph Q ~ {TZ, E{Q)), we want to identify a set of edges in the induced subgraph H[R] that can 
be estimated by applying a UGMS algorithm to either R or a set of vertices that contains R. With this goal 
in mind, define the children ch{R) of a region R as follows: 



Children: ch{R) = Is : {R,S) E sl 



(1) 



We say R connects to S if (R, S) G E{Q). Thus, the children in ([T]) contains all regions that R connects to. 
For example, in Figure [4ljc) , 

cM{2,3,4,6}) = {{2,3,6},{3,4,6}}. 

If there exists a direct path from S to i?, we say S is an ancestor of R. The set of all ancestors of R is 
denoted by an{R). For example, in Figure[ll^c), 

an({5,6,8,9}) = 0, 

an({3, 5, 6}) = {{3, 5, 6, 8}, {2, 3, 5, 6}}, and 
an({3, 6}) = {{3, 5, 6}, {2, 3, 6}, {3, 4, 6}, {2, 3, 5, 6}, {2, 3, 4, 6}, {3, 4, 6, 7}, {3, 5, 6, 8}}}. 

The notation R takes the union of all regions in an{R) and R so that 



R^ U S. 

Se{an{R),B.} 



(2) 



Thus, R contains the union of all clusters in the junction tree that contains R. An illustration of some of 
the notations defined on region graphs is shown in Figure [5) Using ch{R), define the subgraph 7J^ as|f| 



H'j, = H[R]\{UsechiR}Ks} , 



(3) 



SPor graphs Gi and G2, E{Gi\G2) = E(Gi)\E(G2) and E{Gi U G2) = E{Gi) U E{G2) 



Algorithm 2: UGMS over regions in a region graph 



Input: Region graph Q = (7?., E{Q)), a region R, observations X", and a UGMS algoritlim ^. 



Compute H'j^ using ([3]) and R using ([2]). 

Apply ^ to X^ to estimate edges in iJ^. See Appendix [B] for examples. 



4: Return the estimated edges Ef;. 



"V 



T 
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• • • f? • • • 
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Figure 5: Notations defined on region graphs. The children ch{R) are the set of regions that R connects to. 
The ancestors an{R) are all the the regions that have a directed path to the region R. The set R takes the 
union of all regions in an[R) and R. 



where H[R] is the induced subgraph that contains all edges in H over the region R and Ks is the complete 
graph over 5*. In words, H'j^ is computed by removing all edges from, H[R\ that are contained in another 
separator. For example, in Figure |4l[c), when R = {5,6,8}, E{H'j^) — {(5, 8), (6,8)}. The subgraph H'j^ is 
important since it identifies the edges that can be estimated when applying a UGMS algorithm to the set of 
vertices R. 

Proposition 5.1. Suppose E{G*) C E{H). All edges in H'j^ can he estimated by solving a UGMS problem 
over the vertices R. 



Proof See Appendix [Cl 



D 



Proposition 15.11 savs that all edges in H'p^ can be estimated by applying a UGMS algorithm to the set of 
vertices R. The intuition behind the result is that only those edges in the region R can be estimated whose 
Markov properties can be deduced using the vertices in R. Moreover, the edges not estimated in H[R] share 
an edge with another region that does not contain all the vertices in R. Algorithm [2] summarizes the steps 
involved in estimating H'j^ using the UGMS algorithm VP defined in Section [^31 Some examples on how to 
use Algorithm [2] to estimate some edges of the graph in Figure SJa) using the region graph in Figure HJ^c) 
are described as follows. 

1. Let R= {1,3, 5}. This region only connects to {3, 5}. This means that all edges, except the edge (3, 5) 
in -ff[i?], can be estimated by applying ^ to R. 

2. Let R = {3, 5, 6}. The children of this region are {3, 5}, {5, 6}, and {3, 6}. This means that H'j^ ~ 0, 
i.e., no edge over H[R] can be estimated by applying ^ to {3, 5, 6}. 

3. Let R = {3,4,6}. This region only connects to {3,6}. Thus, all edges except (3,6) can be estimated. 
The regions {2, 3, 4, 6} and {3, 4, 6, 7} connect to R, so * needs to be applied to i? = {2, 3, 4, 6, 7}. 



6. UGMS Using Junction Trees: A General Framework 

In this Section, we present the main junction tree framework for UGMS using the results from Sections |4][5l 
Section 16.11 presents the junction tree framework. Section 16.21 discusses the computational complexity of the 



Notation 



Description 



G* 
H 

Q-- 

n 

R 
H' 



= {V,E{G*)) 

= (7^l,...,7^L) 



Unknown graph that we want to estimate. 

Known graph such that E{G*) C E{H). 

Region graph of J' constructed using Algorithm [T] 

Partitioning of the regions in TZ into L labels. 

The set of vertices used when applying ^ to estimate edges over R. 

Edges in H[R] that can be estimated using Algorithm [2] See ([3]). 



Table 1: A summary of some notations. 



framework. Section 16.31 highlights the advantages of using junction trees for UGMS using some examples. 
We refer to Table [1] for a summary of all the notations that we use in this Section. 



6.1 Description of Framework 

Recall that Algorithm [2] shows that to estimate a subset of edges in H[R], where i? is a region in the region 
graph Q, the UGMS algorithm ^ in Assumption [T] needs to be applied to the set R defined in ([2]). Given 
this result, a straightforward approach to decomposing the UGMS problem is to apply Algorithm [2] to each 
region R and combine all the estimated edges. This will work since for any R,S £ TZ such that R y^ S, 
E{H'j^) n E{Hg) = 0. This means that each application of Algorithm [2] estimates a different set of edges in 
the graph. However, for some edges, this may require applying a UGMS algorithm to a large set of nodes. 
For example, in Figure SJ^c), when applying Algorithm [2] to R — {3,6}, the UGMS algorithm needs to be 
applied to i? = {2, 3, 4, 5, 6, 7, 8}, which is almost the full set of vertices. To reduce the problem size of the 
subproblems, we apply Algorithms [T] and [2] in an iterative manner as outlined in Algorithm [31 
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Figure 6: A high level overview of the junction tree framework for UGMS in Algorithm [3] 

Figure [6] shows a high level description of Algorithm [3l We first find a junction tree and then a region 
graph of the graph H using Algorithm [TJ We then find the row in the region graph over which edges can 
be estimated and apply Algorithm [2] to each region in that row. We note that when estimating edges over 
a region, we use model selection algorithms to choose an appropriate regularization parameter to select the 
number of edges to estimate. Next, all estimated edges are added to G and all edges that are estimated are 
removed from H . Thus, H now represents all the edges that are left to be estimated and G U H contains 
all the edges in G* . We repeat the above steps on a new region graph computed using GU H and stop the 
algorithm when H is an empty graph. 

An example illustrating the junction tree framework is shown in Figure[71 The region graph in Figure[71[b) 
is constructed using the graph H in Figure [TJ^a) . The true graph G* we want to estimate is shown in 
Figure [Ha). The top and bottom in Figure [TJc) show the graphs G and H, respectively, after estimating all 
the edges in 7?,i of Figure [TJ^b). The edges in G are represented by double lines to distinguish them from the 
edges in H. Figure [Zl^d) shows the region graph oi GU H. Figure [TJe) shows the updated G and H where 
only the edges (4, 5) and (5, 6) are left to be estimated. This is done by applying Algorithm [2] to the regions 



Algorithm 3: Junction Tree Framework; for UGMS 



See Table [T] for some notations. 

Step 1. Initialize G so that E{G) — and find the region graph Q of H. 

Step 2. Find the smallest £ such that there exists a region R ^ TZi such that E{H'j^) ^ 0. 

Step 3. Apply Algorithm [2] to each region in TZi. 

Step 4. Add all estimated edges to G and remove edges from H that have been estimated. Now H U G 
contains all the edges in G* . 

Step 5. Compute a new junction tree and region graph Q using the graph G U H. 

Step 6. If E{H) = 0, stop the algorithm, else go to Step 2. 
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Figure 7: Example to illustrate the junction tree framework in Section [6. II 

in 7?.2 of Figure [7l[f) . Notice that we did not include the region {1,2} in the last region graph since we know 
all edges in this region have already been estimated. In general, if E{H[B]) = for any region i?, we can 
remove this region and thereby reduce the computational complexity of constructing region graphs. 

6.2 Computational Complexity 

In this Section, we discuss the computational complexity of the junction tree framework. It is difficult to 
write down a closed form expression since the computational complexity depends on the structure of the 
junction tree. Moreover, merging clusters in the junction tree can easily control the computations. With 
this in mind, the main aim in this Section is to show that the complexity of the framework is roughly the 
same as that of applying a standard UGMS algorithm. Consider the following observations. 

1. Computing H : Assuming no prior knowledge about H is given, this graph needs to be computed from 
the observations. This can be done using standard screening algorithms, such as those in [ISIIH], or 
by applying a UGMS algorithm with a regularization parameter that selects a larger number of edges 
(than that computed by using a standard UGMS algorithm). Thus, the complexity of computing H is 
roughly the same as that of applying a UGMS algorithm to all the vertices in the graph. 

2. Applying UGMS to regions: Recall from Algorithm [2] that we apply a UGMS algorithm to observations 
over R to estimate edges over the vertices R, where i? is a region in a region graph representation of 
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H . Since \R\ < p, it is clear that the complexity of Algorithm [5] is less than that of applying a UGMS 
algorithm to estimate all edges in the graph. 

3. Computing junction trees: There are many different junction tree representations of a given graph. In 
the literature, an optimal junction tree is defined as having the minimal maximum width, where the 
width of a junction tree is the maximum size of the cluster. Finding an optimal junction tree is known 
to be computationally intractable [3]. However, several tractable algorithms have been proposed in 
the literature that compute a close to optimal junction tree in time at most 0{p^) [Sllll]- This time 
complexity is less than that of standard UGMS algorithms. 

It is clear that the complexity of all the intermediate steps in the framework is less than that of applying 
a standard UGMS algorithm. The overall complexity of the framework depends on the number of clusters in 
the junction tree and the size of the separators in the junction tree. The size of the separators in a junction 
tree can be controlled by merging clusters that share a large separator. This step can be done in linear time. 
Removing large separators also reduces the total number of clusters in a junction tree. In the worst case, 
if all the separators in H are too large, the junction tree will only have one cluster that contains all the 
vertices. In this case, using the junction tree framework will be no different than using a standard UGMS 
algorithm. 

6.3 Advantages of using Junction Trees and Region Graphs 

An alternative approach to estimating G* using H is to modify some current UGMS algorithms (see Ap- 
pendix [B] for some concrete examples). For example, neighborhood selection based algorithms first estimate 
the neighborhood of each vertex and then combine all the estimated neighborhoods to construct an estimate 
G of G* [7j j33j j54j.37j . Two ways in which these algorithms can be modified when given H are described as 
follows: 

1. A straightforward approach is to decompose the UGMS problem into p different subproblems of esti- 
mating the neighborhood of each vertex. The graph H can be used to restrict the estimated neighbors 
of each vertex to be subsets of the neighbors in H. For example, in Figure [Tja), the neighborhood of 
1 is estimated from the set {2, 3, 4, 5} and the neighborhood of 3 is estimated from the set {1, 4, 5, 6}. 
This approach can be compared to independently applying Algorithm [2] to each region in the region 
graph. For example, when using the region graph, the edge (1,4) can be estimated by applying a 
UGMS algorithm to {1,3,4,5}. In comparison, when not using region graphs, the edge (1,4) is es- 
timated by applying a UGMS algorithm to {1,2,3,4,5}. In general, using region graphs results in 
smaller subproblems. A good example to illustrate this is the star graph in Figure [TKg). The junction 
tree framework only requires applying a UGMS algorithm to a pair of nodes. On the other hand, 
neighborhood selection needs to be applied to all the nodes to estimate the neighbors of the central 
node 1 which is connected to all other nodes. 

2. An alternative approach is to estimate the neighbors of each vertex in an iterative manner. However, 
it is not clear what ordering should be chosen for the vertices. The region graph approach outlined in 
Section Wl] leads to a natural choice for choosing which edges to estimate in the graph so as to reduce 
the problem size of subsequent subproblems. Moreover, iteratively applying neighborhood selection 
may still lead to large subproblems. For example, suppose the star graph in Figure [71[g) is in fact the 
true graph. In this case, using neighborhood selection always leads to applying UGMS to all the nodes 
in the graph. 

From the above discussion, it is clear that using junction trees for UGMS leads to smaller subproblems 
and a natural choice of an ordering for estimating edges in the graph. We will see in Section |5] that the 
smaller subproblems lead to weaker conditions on the number of observations required for consistent graph 
estimation. Moreover, our numerical simulations in Section [9] empirically show the advantages of using 
junction tree over neighborhood selection based algorithms. 
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Algorithm 4: PC-Algorithm for UGMS: PC(K,X",i7,i) 



Inputs: 

k: An integer that controls the computational complexity of PC. 
X": n i.i.d. observations. 

H: A graph that contains all the true edges G* . 
L: A graph that contains the edges that need to be estimated. 
Output: A graph G that contains edges in H' that are estimated to be in G* . 

1 G^L 

2 for each k G {0, 1, . . . ,k} do 

3 for each {i,j) G E{G) do 

4 Sij ^r- Neighbors of i or j in H depending on which one has lower cardinality. 

5 if 3 5' C Sij, \S\ — k, s.t. Xi A. Xj\Xs (computed using X"J then 
I Delete edge (i,j) from G and H. 



7 Return G. 



7. PC-Algorithm for UGMS 

So far, we have presented the junction tree framework using an abstract undirected graphical model selection 
(UMGS) algorithm. This shows that our framework can be used in conjunction with any UGMS algorithm. In 
this Section, we review the PC- Algorithm, since we use it to analyze the junction tree framework in Section[8] 
The PC- Algorithm was originally proposed in the literature for learning directed graphical models [53]. The 
first stage of the PC- Algorithm, which we refer to as PC, estimates an undirected graph using conditional 
independence tests. The second stage orients the edges in the undirected graph to estimate a directed 
graph. We use the first stage of the PC- Algorithm for UGMS. Algorithm 0] outlines PC. Variants of the 
PC- Algorithm for learning undirected graphical models have recently been analyzed in [BIS. The main 
property used in PC is the global Markov property of undirected graphical models which states that if a set 
of vertices S separates i and j, then Xi A. Xj\Xs. As seen in Line 5 of Algorithm 21 PC deletes an edge {i,j) 
if it identifies a conditional independence relationship. Some properties of PC are summarized as follows: 

1. Parameter k: PC iteratively searches for separators for an edge (?, j) by searching for separators of size 
0, 1, . . . , K. This is reflected in Line 2 of Algorithm |4l Theoretically, the algorithm can automatically 
stop after searching for all possible separators for each edge in the graph. However, this may not be 
computationally tractable, which is why k needs to be specified. 

2. Conditional Independence Test: Line 5 of Algorithm U] uses a conditional independence test to 
determine if an edge (i, j) is in the true graph. This makes PC extremely fiexible since nonparametric 
independence tests may be used, see [161I361F5B] for some examples. In this paper, for simplicity, we 
only consider Gaussian graphical models. In this case, conditional independence can be tested using 
the conditional correlation coefficient defined as 

Conditional correlation coefficient: pij^s — ' , (4) 

\/^i.i\S^j,j\S 

where Px ~ A/'(0, S), T,a,b is the covariance matrix of Xa and Xb, and Sa.s|s is the conditional 
covariance defined by 

^A,B\S — ^A,B — ^A,S^S^s^B,S ■ (5) 

Whenever Xi A. Xj\Xs, then Pij|s — 0. This motivates the following test for independence: 

Conditional Independence Test: \pij\s\ < A„ => Xi X Xj\Xs , (6) 

where Pij\s is computed using the empirical covariance matrix from the observations X". The regular- 
ization parameter A„ controls the number of edges estimated in G. 
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3. The graphs H and L: Recall that H contains all the edges in G* . The graph L contains edges 
that need to be estimated since, as seen in Algorithm [2l we apply UGMS to only certain parts of the 
graph instead of the whole graph. As an example, to estimate edges in a region i? of a region graph 
representation of H , we apply Algorithm 2] as follows: 

Gr = PC {f^,X^, H, H'j,) , (7) 

where i?^ is defined in ([3]) . Notice that we do not use i? in ([7]) . This is because Line 4 of Algorithm 2] 
automatically finds the set of vertices to apply the PC algorithm to. Alternatively, we can apply 
Algorithm |4] using R as follows: 

Gr ^ PC {ri,X^,KT^,H'^) , (8) 

where K^ is the complete graph over R. 

4. The set Sij-. An important step in Algorithm 2] is specifying the set Stj in Line 4 to restrict the 
search space for finding separators for an edge (?, j). This step significantly reduces the computational 
complexity of PC and differentiates PC from the first stage of the SGS- Algorithm [?3], which specifies 

Sr,=V\{l,3). 

8. Theoretical Analysis of Junction Tree based PC 

We use the PC-algorithm to analyze the junction tree based UGMS algorithm. Our main result, stated in 
Theorem 18.11 shows that when using the PC- Algorithm with the junction tree framework, we can potentially 
estimate the graph using fewer number of observations than what is required by the standard PC- Algorithm. 
As we shall see in Theorem 18.11 the particular gain in performance depends on the structure of the graph. 

Section 18.11 discusses the assumptions we place on the graphical model. Section 18.21 presents the main 
theoretical result highlighting the advantages of using junction trees. We use standard asymptotic notation 
so that f{n) — Vl{g{n)) implies that there exists an N and a constant c such that for all n > N, f{n) > cg{n). 
For f{n) — 0{g{n)), replace > by <. 

8.1 Assumptions 

(Al) Gaussian graphical model: We assume X = {Xi, . . . , Xp) ^ Px, where Px is a multivariate normal 
distribution with mean zero and covariance E. Further, Px is Markov on G* and not Markov on any 
subgraph of G* . It is well known this is assumption translates into the fact that S,^^ = if and only 
if(*,j)^G* I4T]. 

(A2) Faithfulness: If Xi A. Xj\Xs, then i and j are separated bjQ S. This assumption is important for 
the PC algorithm to output the correct graph. Further, note that the Markov assumption is different 
since it goes the other way: if i and j are separated by S, then Xi A. Xj\Xs. Thus, when both (Al) 
and (A2) hold, we have that X, X Xj\Xs ^^ {i,j) i G* . 

(A3) Separator Size 77: For all (j,j) ^ G* , there exists a subset of nodes S C V\{i,j}, where |5| < 77, 
such that S* is a separator for i and j in G*. This assumption allows us to use k — rj when using PC. 

(A4) Conditional Correlation Coefficient pij^g and E: Under (A3), we assume that p^is satisfies 

sup{|p„|s| ■.t,j&V,ScV, \S\ < ry}} < Af < 1 , (9) 

where M is a constant. Further, we assume that maxj 5 |5|<^ Ej jig < L < cx). 

(A5) High-Dimensionality We assume that the number of vertices in the graph p scales with n so that 
p — )> CX) as n — ^ CX). Furthermore, both Pij\s and 77 are assumed to be functions of n and p unless 
mentioned otherwise. 



'If S is the empty set, this means that there is no edge between i and j. 
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(a) Structure of the graph in (A5) (b) Region graph of (a) 

Figure 8: General Structure of the graph we use in showing the advantages of the junction tree framework. 

(A6) Structure of G*: Under (A3), we assume that there exists a set of vertices Vi, V2, and T such that 
T separates Vi and V2 in G* and \T\ < -q. Figure |Sl[a) shows the general structure of this assumption. 
As we will see in the next Section, this structure of the graph will allow us to apply the junction tree 
framework to the region graph representation in Figure [SJb) . 



8.2 Theoretical Result and Analysis 

Recall PC in Algorithm 21 Since we assume (Al), the conditional independence test in (j6|) can be used in 
Line 5 of Algorithm |4l To analyze the junction tree framework, consider the following steps to construct G 
using PC when given n i.i.d. observations X": 



Step 1. Compute H: Apply PC using a regularization parameter A" such that 

i/=PC(|r|,X",i^y,i^y), 



(10) 



where Ky is the complete graph over the nodes V . In the above equation, we apply PC to remove 
all edges in G* for which there exists a separator of size less than or equal \T\. 

Step 2. Estimate a subset of edges over 14 U T and V2 U T using regularization parameters A^ and A^, 
respectively, such that 



Gv, = PC (r;, X", H[Vk U T] U Kt, H'y^^T) Mk^ 1, 2. 
where i?(/^uT ^ H[Vk\J T]\Kt as defined in 0. 
Step 3. Estimate edges over T using a regularization parameter A^: 

Gt = PC (r;, X", H[T U nee* {T)].H[T]) . 

Step 4. Final estimate is G = Gvi U Gy^ U Gt- 



(11) 



(12) 



For the region graph in Figure |8l[b) , Steps 2 and 3 correspond to applying PC to the regions ViVJT and 
V2 UT. Step 4 corresponds to applying PC to the region T and all neighbors of T in G*. Step 4 corresponds 
to applying PC to the region T and all neighbors of T in G*. Although the neighbors of T are sufficient 
to estimate all the edges in T, in general, depending on the graph, a smaller set of vertices is required to 
estimate edges in T. The main result is stated using the following terms defined on the graphical model: 



Pi = |Vi| + |T| 

P2 = m + \T\ 

PT = \TUneG'{T)\ 



VT = |r| 

Pi =inf{|Py|sl 
P2 =inf{|py|5| 



:^,JS.t.\S\<^^Tk\p^J\s\>0} 



leViJeViUT s.t. (z, j) e E{G*),S CViUT, \S\ < 77} 
:ieV2,jeV2UT s.t. {1,3) e E{G*),S C F2 U T, 1^1 < r/} 



(13) 
(14) 
(15) 
(16) 
(17) 
(18) 
(19) 
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PT = inf{|p,,|5| : i, J e T s.t. {t,j) eE,SCTU neG{T),VT < \S\ < v} , (20) 

The term po is a measure of how hard it is to learn the graph H in Step 1 so that E{G*) C E{H) and 
aU edges that have a separator of size less than |r| are deleted in H. The terms pi and p2 are measures 
of how hard it is learn the edges in G*[Vi UT]\Kt and G*\V2 U T]\Kt (Step 2), respectively, given that 
E{G*) C E{H). The term pT is a measure of how hard it is learn the graph over the nodes T given that we 
know the edges that connect Vi to T and V2 to T. 

Theorem 8.1. Under Assumptions (Al)-(A6), there exist a conditional independence test such that if 

n = n{ma.x{p^'^TjTiog{p),p:^;'^rj\og{pi),p2'^Tj\og{p2),PT'^'n\og{pT)}) , (21) 

then P{G 7^ G) — > as ?i — > 00. 

Proof See Appendix [El D 



Remark 8.1 (Choice of Regularization Parameters). We use the conditional independence test in (j6]) that 
thresholds the conditional correlation coefficient. From the proof in Appendix |e1 the thresholds, which we 
refer to as the regularization parameter, are chosen as follows: 

AO = 0(po) and X^ = ^ (V^Tlog(p)/n) (22) 

A,^; = 0(pk) and A,^; = n (V7?log(pfc)/n) , fc = 1, 2 (23) 

A^ = 0{pt) and A^ - O (Vr?log(pT)/n) . (24) 

We clearly see that different regularization parameters are used to estimate different parts of the graph. 

Remark 8.2 (Weaker Condition) . If we do not use the junction tree based approach outlined in Steps 1-4 and 
instead directly apply PC, the sufficient condition on the number of observations will be n = ri(p~^yj?7log(p)), 
where 

p^„ := inf{|p,,|s| : (i,i) G E{G*), \S\ < r,} . (25) 

This result is proved in Appendix [P] using results from [I][3^. Since Pmin < minlPOiPi, P2,/5t}, it is clear 
that (j21l) is a weaker condition. The main reason for this difference is that the junction tree approach defines 
an ordering on the edges to test if an edge belongs to the true graph. This ordering allows for a reduction 
in separator search space (see Sij in Algorithm |4]) for testing edges over the set T. Standard analysis of 
PC assumes that the edges are tested randomly, in which case, the separator search space is always upper 
bounded by the full set of nodes. 

Remark 8.3 (Reduced Sample Complexity). Suppose 77, po, and pr are constants and pi < p2- In this 
case, (|2T|) reduces to 



n = r2(max{log(p),Pi^log(pi),p2^1og(p2)}) ■ (26) 

If p^^ = O (max{p2"^log(p2)/log(pi),log(p)}), then (^5]) reduces to 

n^n{p^^log{pi)) . (27) 

On the other, if we do not use junction trees, n — il (p^j^„ log(p)), where Pmin < Pi- Thus, if pi ^ p, 
for example pi = log(p), then using the junction tree based PC requires lower number of observations for 
consistent UGMS. Informally, the above condition says that if the graph structure in (A6) is easy to identify. 
Pi "^ P2-, and the minimal conditional correlation coefficient over the true edges lies in the smaller cluster 
(but not over the separator), the junction tree framework can accurately learn the graph using significantly 
less number of observations. 
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Remark 8.4 (Learning Weak Edges). We now analyze Theorem 18. II to see how the partial correlations scale 
for high-dimensional consistency. Under the assumption in Remark 18. 3[ it is easy to see that the minimal 
partial correlation scales as Q,{y^\og{pi)/n) when using junction trees and as il{y^\og{p)/n) when not using 
junction trees. Thus, it is clear that when pi <^ p, it is possible to learn edges with weaker partial correlation 
when using junction trees. 

Remark 8.5 (Computational complexity). It is easy to see that the worst case computational complex- 
ity of the PC-Algorithm is 0(p''+^) since there are 0{p^) edges and testing for each edges requires a 
search over at most 0{p^) separators. The worst case computational complexity of Steps 1-4 is roughly 

O (pl^l+^ + Pi + pI '^ Pt ) • Under the conditions in Remark 8.3 and when pi ^ p, this complexity is 
roughly 0(p''+^), which is the same as the standard PC- Algorithm. In practice, especially when the graph is 
sparse, the computational complexity is much less than 0(p''+^) since the PC- Algorithm restricts the search 
space for finding separators. 

Remark 8.6 (Extensions). We have analyzed the junction tree framework assuming that the junction 
tree of H only has two clusters. Extending the analysis to junction trees with more than two clusters is 
trivial. Moreover, the analysis can be easily extended to graphical models that takes values in some discrete 
space. In this case, the conditional independence test can be done using the empirical conditional mutual 
information. Further, we presented theoretical results only for the PC-Algorithm. Similar results, under 
different assumptions, can be obtained when analyzing other UGMS algorithms. 

9. Numerical Simulations 

In this Section, we present numerical simulations that highlight the advantages of using the junction tree 
framework for UGMS. Throughout this Section, we assume a Gaussian graphical model such that Px ~ 
A/'(0, 6~^) is Markov on G* . It is well known that this implies that (i,j) ^ G* ■^=^ 9.^ = [42]. Some 
algorithmic details used in the simulations are described as follows. 

Computing H: We apply Algorithm U] with a suitable value of k in such a way that the separator search 
space Sij (see Line 4) is restricted to be small. We use the conditional partial correlation to test for conditional 
independence and choose a separate threshold to test for each edge in the graph. The thresholds for the 
conditional independence test are computed using 5-fold cross-validation. The computational complexity of 
this step is roughly O(p^) since there are 0{p^) edges to be tested. Note that this method for computing H is 
equivalent to Step 1 in Section [5?^ with \T\ = k. Finally, we note that the above method does not guarantee 
that all edges in G* will be included in H. This can result in false edges being included in the junction tree 
estimated graphs. To avoid this situation, once a graph estimate G has been computed using the junction 
tree based UGMS algorithm, we apply conditional independence tests again to prune the estimated edge set. 

Computing the junction tree: We use standard algorithms in the literature for computing close to 
optimal junction treCl- Once the junction tree is computed, we merge clusters so that the maximum size of 
the separator is at most k + I, where k is the parameter used when computing the graph H. 

UGMS Algorithms: We apply the junction tree framework in conjunction with graphical Lasso (gL) [5], 
neighborhood selection using Lasso (nL) [S^, and the PC- Algorithm (PC) [43]. See Appendix [BJ for a review 
of gL and nL and Algorithm H] for PC. When using nL, we use the intersection rule to combine neighborhood 
estimates. Further, we use the adaptive Lasso |57j for finding neighbors of a vertex since this is known to 
give superior results for variable selection |46) . 

Choosing Regular izat ion Parameters: An important step when applying UGMS algorithms is to choose 
a suitable regularization parameter. It is now well known that classical methods, such as cross-validation 
and information criterion based methods, tend to choose a much larger number of edges when compared to 
an oracle estimator for high-dimensional problems |281I32) . Several alternative methods have been proposed 
in the literature; see for example stability selection [28l|32] and extended Bayesian information (EBIC) 
criterion [HIIT^- In all our simulations, we use EBIC since it is much faster than stability based methods 



^We use the GreedyFillin heuristic. This is known to give good results with reasonable computational time |21| 
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• Chain (CHi and CH2): 6i,i+i 


= pi for i — 1, 


(strong edges). For CHi, pi 


— 0.15 and p2 


e^J = e,,. 





when the distribution is Gaussian. EBIC selects a regularization parameter A„ as follows: 

A„ = min |n logdeteA„ - trace(5X") + |£:(GaJ| logn + 47|S(Ga„)| logpj , (28) 

where 0a„ is the estimate of the inverse covariance matrix and |£'(G'a„)| is the number of edges in the 
estimated graph. The estimate A„ depends on a parameter 7 G [0, 1] such that 7 = results in the BIC 
estimate and increasing 7 produces sparser graphs. The authors in Reference [13] suggest that 7 = 0.5 
is a reasonable choice for high-dimensional problems. When solving subproblems using Algorithm [21 the 
\ogp term is replaced by log|i?|, 0a„ is replaced by the inverse covariance over the vertices R, and \G\^\ is 
replaced by the number of edges estimated from the graph _ff^. 

Small subproblems: Whenever \R\ is small (less than 8 in our simulations), we independently test whether 
each edge is in G* using hypothesis testing. This shows the application of using different algorithms to learn 
different parts of the graph. 

9.1 Results on Synthetic Graphs 

We assume that 9^^ = 1 for alH = 1, . . . ,p. We refer to all edges connected to the first pi vertices as weak 
edges and the rest of the edges are referred to as strong edges. The different types of synthetic graphical 
models we study are described as follows: 

. ,pi — 1 (weak edges) and <di,i+i = P2 iov i = pi,p—l 
= 0.245. For CH2, pi = 0.075 and p2 = 0.245. Let 

• Cycle (CYi and CY2): Qi.i+i = Pi for i = 1, . . . ,pi — 1 (weak edges) and 8i,i+i = p2 for i = pi,p — 1 
(strong edges). In addition, Qi^i+3 = pi for i — 1, . . . ,pi — 3 and Qi.i+3 = p2 ior i = pi,pi + l, . . . ,p — 3. 
This introduces multiple cycles in the graph. For CYi, pi = 0.15 and p2 = 0.245. For CY2, pi = 0.075 
and p2 = 0.245. 

• Hub (HBi and HB2): For the first pi vertices, construct as many stao graphs of size di as possible. 
For the remaining vertices, construct star graphs of size d2 (at most one may be of size less than ^2)- 
The hub graph G* is constructed by taking a union of all star graphs. For {i,j) G G* s.t. i,j < pi, let 
Qij = 1/di. For the remaining edges, let Qij = 1/^2. For HBi, di = 8 and d2 = 5. For HB2, di — 12 
and d2 — 5. 

• Neighborhood graph (NBi and NB2): Randomly place vertices on the unit square at coordinates 
yi,...,yp. Let Qij = 1/pi with probability ('\/27r)~^ exp(— 4||yi — 2/JII2), otherwise O^ = for all 
i,j e {1, . . . ,pi} such that i > j. For all i,j G {pi + 1, . . . ,p} such that i > j, Qij = p2- For edges over 
the first pi vertices, delete edges so that each vertex is connected to at most di other vertices. For the 
vertices pi + 1, . . . ,p, delete edges such that the neighborhood of each vertex is at most d2. Finally, 
randomly add four edges from a vertex in {1, . . . ,pi} to a vertex in {pi,pi + 1, . . . ,p} such that for 
each such edge, 9^ = pi. We let p2 = 0.245, di = 6, and d^ — 4. For NBi, pi — 0.15 and for NB2, 
P2 = 0.075. 

Notice that the parameters associated with the weak edges are lower than the parameters associated with 
the strong edges. Some comments regarding notation and usage of various algorithms is given as follows. 

• The junction tree versions of the UGMS algorithms are denoted by JgL, JPC, and JnL. 

• We use EBIC with 7 = 0.5 to choose regularization parameters when estimating graphs using JgL and 
JPC. To objectively compare JgL (JPC) and gL (PC), we make sure that the number of edges estimated 
by gL (PC) is roughly the same as the number of edges estimated by JgL (JPC). 



^A star is a tree where one vertex is connected all other vertices. 
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• The nL and JnL estimates are computed differently since it is difficult to control the number of edges 
estimated using both these algorithmtlj. We apply both nL and JnL for multiple different values of 7 
(the parameter for EBIC) and choose graphs so that the number of edges estimated is closest to the 
number of edges estimated by gL. 

• When applying PC and JPC, we choose k as 1, 2, 1, and 3 for Chain, Cycle, Hub, and Neighborhood 
graphs, respectively. When computing H, we choose k as 0, 1, 0, and 2 for Chain, Cycle, Hub, and 
Neighborhood graphs, respectively. 

Tables [2]l5] summarize the results for the different types of synthetic graphical models. For an estimate G 
of G* , we evaluate G using the weak edge discovery rate (WEDR), false discovery rate (FDR), true positive 
rate (TPR), and the edit distance (ED). 

WEDR^-tp^^^^^f^^ (29) 

# of weak edges m G 

# of edges in G\G* , , 
FDR = ^ (30) 

# of edges in G 

rrpp _ # of edges in G n G* 
^" # of edges in G* ^'^^ 

ED = {# edges in G\G*} + {# edges in G*\G} , (32) 

Recall that the weak edges are over the first pi vertices in the graph. Naturally, we want WEDR and TPR 
to be large and FDR and ED to be small. Each entry in the table shows the mean value and standard error 
(in brackets) over 50 observations. We now make some remarks regarding the results. 

Remark 9.1 (Graphical Lasso). Of all the algorithms, graphical Lasso (gL) performs the worst. On the 
other hand, junction tree based gL significantly improves the performance of gL. Moreover, the performance 
of JgL is comparable, and sometimes even better, when compared to JPC and JnL. This suggests that when 
using gL in practice, it is beneficial to apply a screening algorithm to remove some edges and then use the 
junction tree framework in conjunction with gL. 

Remark 9.2 (PC- Algorithm and Neighborhood Selection). Although using junction trees in conjunction 
with the PC-Algorithm (PC) and neighborhood selection (nL) does improve the graph estimation perfor- 
mance, the difference is not as significant as gL. The reason is because both PC and nL make use of the local 
Markov property in the graph H . The junction tree framework further improves the performance of these 
algorithms by making use of the global Markov property, in addition to the local Markov property. 

Remark 9.3 (Chain Graph). Although the chain graph does not satisfy the conditions in (A6), the junction 
tree estimates still outperforms the non-junction tree estimates. This suggests the advantages of using 
junction trees beyond the graphs considered in (A6). We suspect that correlation decay properties, which 
have been studied extensively in [TJ[2], can be used to weaken the assumption in (A6). 

Remark 9.4 (Hub Graph). For the hub graph HBi, the junction tree estimate does not result in a significant 
difference in performance, especially for the PC and nL algorithms. This is mainly because this graph 
is extremely sparse with multiple components. For the number of observations considered, H removes a 
significant number of edges. However, for HB2, the junction tree estimate, in general, performs slightly 
better. This is because the parameters associated with the weak edges in HB2 are smaller than that of HBi. 

Remark 9.5 (General Conclusion). We see that, in general, the WEDR and TPR are higher, while the 
FDR and ED are lower, for junction tree based algorithms. This clearly suggests that using junction trees 
results in more accurate graph estimation. Moreover, the higher WEDR suggest that the main differences 
between the two algorithms are over the weak edges, i.e., junction tree based algorithms are estimating more 
weak edges when compared to a non junction tree based algorithm. 



'^Recall that both these algorithms use different regularization parameters, so there may exist multiple different estimates 
with the same number of edges 
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Table 2: Results for Chain graphs: p — 100 and pi = 20 



Model 



Alg WEDR 



FDR 



TPR 



ED 



\G\ 



CHi 

p= 100 



300 



JgL 

IL 
JPC 

PC _ 

JnL" 

nL 



0.305 (0.00473) 
_0A8 (01)04262 _ 

0.312 (0.00441) 
_0^264_(0.00513)_ 

0.306 (0.00477) 

0.271 (0.00446) 



0.048 (0.000972) 0.767 

"0.0466 (0.000954) 0.775 
0.0473 (0.00107) 0.781 



0.0723 (0.00109) 
0.0729 (0.00126) 



0.769 
0.757 



0.00158) 
0.00135) 



0.00135) 
0I)0138)_ 
0.00149) 
0.00147) 



27 (0.176) 
_29 _(0 ^53] _ 
26 (0.162) 
25.6 (0.169) 



28.8 (0.188) 
30 (0.197) 



79.8 

_79.8 

80.5 

81.2 



82.1 
80.9 



CHz 300 JgL 0.0516 (0.00199) 

p = 100 gL 0^00947_(0.00103)_ 

" JPC 0.0484 (0.00189) 

fl _ A033Z iOX)0183_) _ 
" JnL 0.0516 (0.00204) 
nL 0.0389 (0.00201) 



0.0672 (0.00121) 0.727 
0.0619 (0.000974) 0.733 



0.0637 (0.00114) 
0j0515i0j0?l?4) 
"0.077 (0.00113) 
0.086 (0.00144) 



0.735 
0.748 



0.00141) 
0I)0146)_ 
"0.00136) 
0.00114) 



32.2 (0.173) 

31.3 (0.162) 



0.733 
0.723 



0.0014) 
0.00143) 



31.2 (0.169) 
_29.3 (0^144)_ 

"32.5 (orisey 

34.2 (0.216) 



77.3 

77.4 



77.8 

_78.4 

78.7 

78.4 



CHi 

p= 100 



500 



JgL 

JPC 
PC _ 
Jnl" 
nL 



0.596 (0.00551) 
_0^44 (0I)0516)_ _ 

0.612 (0.00507) 
_0^577_(0.0048)_ _ 

0.623 (0.00483) 

0.596 (0.00474) 



0.0206 (0.000597) 0.916 

"0.0215 (0.000705) 0.921 
0.0324 (0.000746) 0.916 



0.00117) 
0.00106) 



0.0588 (0.00092) 
0.0689 (0.00112) 



0.922 
0.918 



0.000976) 
0I)00956)_ 
"0.000925) 
0.000953) 



10.2 (0.133) 

_15.6 (0^132J_ 

9.86 (0.128) 

11.4 (0.124) 



13.5 (0.133) 
14.9 (0.164) 



92.6 
92.7 



93.2 
93.7 



97 
97.6 



CHz 500 JgL 0.0768 (0.00257) 

p = 100 gL 0^0211 (0X)0143_) _ 

" JPC 0.0726 (0.00228) 

f ^ _ A^^^^ A°2.226j _ 
" JnL 0.0758 (0.00243) 

nL 0.0663 (0.00232) 



0.0435 (0.000974) 0.816 
0.0533 (0.000974) 0.808 



0.042 (0.000966) 

0_.048_9_(0_.0011lj 

"0.0702 (0.00109)" 

0.0767 (0.00123) 



0.817 
0.815 
0.818 
0.815 



0.000581) 

0I)00347)_ 

"0.000532) 

0I)00504)_ 

"0.000536) 

0.000555) 



22 (0.107) 
23.5 (0.0824) 



84.5 
84.6 



21.7 (0.0822) 84.5 

_22.5 (0I)918)_ _84.9 

24.2 (0.102) 87.2 

25.1 (0.126) 87.5 



Table 3: Results for Cyele graphs, p — 100 and pi ~ 20 



Model 


n Alg 


WEDR 


FDR 




TPR 




ED 


\G\ 


CYi 

p= 100 


300 JgL 
gL 


0.314 (0.00356) 
0.105 (0.00309) 


0.0355 
0.0556 


(0.000617) 
(0.000806) 


0.814 
0.798 


0.00102) 
0.000995) 


28.5 (0.142) 
32.9 (0.16) 


HI 
112 




JPC 
PC 


0.326 (0.00401) 
0.307 (0.00427) 


0.0302 
0.0266 


(0.000661) 
(0.000707) 


0.819 
0.826 


0.00129) 
0.00125) 


27.2 (0.18) 
26 (0.169) 


112 
112 




JnL 
nL 


0.342 (0.00373) 
0.299 (0.00363) 


0.0429 
0.0443 


(0.000803) 
(0.000974) 


0.813 
0.793 


0.00113) 
0.00131) 


29.5 (0.175) 
32.3 (0.192) 


112 
110 


CY2 

p= 100 


300 JgL 
gL 


0.0472 (0.00164) 
0.0008 (0.000253) 


0.0445 
0.0488 


(0.000906) 
(0.000775) 


0.762 
0.759 


0.000961) 
0.00109) 


36.2 (0.163) 
37 (0.172) 


105 
105 




JPC 
PC 


0.0432 (0.00176) 
0.0272 (0.00147) 


0.0424 
0.0355 


(0.000877) 
(0.000758) 


0.764 
0.773 


0.000869) 
0.000767) 


35.6 (0.174) 

33.7 (0.137) 


105 
106 




JnL 
nL 


0.042 (0.00209) 
0.035 (0.00241) 


0.0575 
0.0569 


(0.00108) 
(0.00118) 


0.754 
0.743 


0.00117) 
0.00129) 


38.6 (0.21) 
39.9 (0.228) 


106 
104 


CYi 

p= 100 


500 JgL 
gL 


0.532 (0.0045) 
0.278 (0.00526) 


0.0222 
0.0707 


(0.000549) 
(0.000723) 


0.907 
0.862 


0.000933) 
0.00102) 


15.1 (0.139) 
26.9 (0.178) 


122 
122 




JPC 
PC 


0.61 (0.0042) 
0.609 (0.00398) 


0.0157 
0.0203 


(0.000575) 
(0.000547) 


0.925 
0.925 


0.000825) 
0.000786) 


11.9 (0.15) 
12.5 (0.134) 


124 
125 




JnL 

nL 


0.612 (0.00449) 
0.584 (0.00449) 


0.028 ( 
0.0406 


0.000605) 
(0.000726) 


0.924 
0.919 


0.000825) 
0.000929) 


13.6 (0.151) 
15.9 (0.171) 


125 
126 


CY2 

p= 100 


500 JgL 
gL 


0.0864 (0.00271) 
0.004 (0.000542) 


0.0389 
0.0578 


(0.000766) 
(0.000768) 


0.821 
0.805 


0.000571) 
0.000359) 


28.1 (0.116) 
32.3 (0.0883) 


113 
113 




JPC 
PC 


0.0872 (0.00233) 
0.0744 (0.00234) 


0.0343 
0.0399 


(0.000682) 
(0.000689) 


0.825 
0.823 


0.000467) 
0.000497) 


27 (0.0988) 
27.9 (0.0995) 


113 
113 




JnL 
nL 


0.085 (0.00315) 
0.069 (0.00309) 


0.0451 
0.053 ( 


(0.00102) 
0.00103) 


0.824 
0.821 


0.00061) 
0.000638) 


28.4 (0.147) 
29.8 (0.158) 


114 
114 
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Table 4: Results for Hub graphs:p — 100 and pi = 20 



Model 



Alg WEDR 



FDR 



TPR 



ED 



\G\ 



HBi 300 JgL 

p = 100 gL_ _ 

" JPC" 
PC _ 
" JnL 
nL 



0.204 (0.00398) 
_0d54_(0.00408)_ 

0.204 (0.00419) 
_0A93_(q.00422)_ 

0.245 (0.00464) 

0.247 (0.00448) 



0.0389 (0.00107) 

°^0.^77_(0_.00107)_ 
"0.038 (0.00106) 

0_.037_7_(0_.000_953)_ 
"0.0887 (0.00131) 

0.0983 (0.00155) 



0.755 (0.00174) 
0.758 (_0.0qi6lj_ 

■ 0.753 (0.00175) 
0.762 (_0.0qi77)_ 

■ 0.75 (0.00178) 
0.752 (0.00182) 



22.3 (0.151) 63.7 
_22.1 (0J3) 63.8 

22.4 (0.16) 63.4 

"26.2 todfiy ~ ~66.7 
26.8 (0.198) 67.6 



HBs 300 JgL 

p = 100 gL 



JPC 
PC _ 

JnL 
nL 



0.0444 (0.00194) 
_0XI133 (0Xiqil8_) _ 

0.0478 (0.00225) 
_0^0289 (0i)qi89_) _ 

0.0544 (0.00248) 

0.0467 (0.00226) 



0.0471 (0.00131) 

0l0i2_5_(0_.00127) 
"0.0431 (0.00114)" 

0_.038_1 _(0_.00105) 
"0.0833 (0.00142)" 

0.0958 (0.00138) 



0.71 (0.00134) 
0.716 (_0-0qi42)_ 

■ 0.709 (0.00128) 
0.718 (_0-0qi4lj_ 

■ 0.704 (0.0012) 
0.7 (0.00116) 



26.7 (0.116) 61.2 

26 (0.121) 61.4 

"26.5 ("07108)" ~ "60.8 

_25.5 (0d21)_ _ _61.3 

29.6 (0.146) 63 

30.7 (0.138) 63.5 



HBi 500 JgL 

p = 100 gL_ _ 

" JP""C" 

PC _ 

" Jnl " 

nL 



0.413 (0.00732) 
_0J364_(q.00704)_ 

0.438 (0.00677) 
_0_.448 (0.00678)_ 

0.507 (0.00615) 

0.52 (0.00706) 



0.0262 (0.000978) 

0_.035_2_(0_.000_983)_ 
"0.0269 (0.000842) 

0_.026_8_(0_.000_834)_ 
"0.0764 (0.00133) 

0.0907 (0.00139) 



0.87 (0.00159) 
q.8_63 (_0-0qi49)_ 
■ 0.878 (0.00147) 
0.882 (_0.0qi43)_ 
0.89 (0.00131) 
0.893 (0.00156) 



12.4 (0.156) 72.4 

13.7 (0d44J_ _ _72.5 
"11.9 (0.148) 73.1 

n.6 (0d41)_ _ _73.4 
"14.9 (0.152) 78.2 

15.9 (0.191) 79.6 



HB2 500 JgL 

p = 100 gL 



JPC 
PC _ 
Jnl" 
nL 



0.0856 (0.00276) 
_0m (01)02522 

0.0967 (0.00288) 
_0 JD867 (0i)028)_ _ 

0.123 (0.00366) 

0.106 (0.00341) 



0.0416 (0.00111) 

0^0474 [omipj 

"0.0395 (0.0012) 
0^0436 lomi29) 

"0.0843 (0.00155)" 
0.105 (0.00147) 



0.794 (0.000676) 
0.789 (_0-000633)_ 

■ 0.798 (0.000732) 
0.797 (_0-000687)_ 

■ 0.804 (0.000816) 
0.801 (0.000743) 



19.8 (0.0855) 68 

20.6 (0^0978)_ _68 _ 
"19.3 (0.109) 68.2 

19.7 (OdU)_ _ _68.4 
"22.2 (0.15) 72.1 

24.1 (0.143) 73.4 



Table 5: Results for Neighborhood graph, p ~ 300 and pi = 30 



Model 



Alg WEDR 



FDR 



TPR 



ED 



\G\ 



NBi 300 JgL 

p = 100 gL_ _ 

" JP~C" 

PC _ 

"Jnl" 

nL 



0.251 (0.00153) 

_OTq2_(q.qoi5)_ _ 

0.259 (0.00159) 

_0^255_(q.q0157)_ 

0.254 (0.00244) 

0.226 (0.00249) 



0.0303 (0.000284) 

0_.0389 _(0_.0q029) _ 
"0.0313 (0.000231) 

0_.036_ (0.000276) _ 
"0.0354 (0.000378) 

0.0389 (0.000451) 



0.813 (0.00049) 

q.8_06 (0.00_0506) 
"0.814 (0.000402)" 

q.8_13 (0.CI0_0466) 
"0.812 (0.000635)" 

0.804 (0.000648) 



126 (0.329) 498 
_135_(q.345)_ _ 498 

126 (0.26) 499 
_129_(q.33)_ _ _ 501 

129 (0.461) 500 

136 (0.458) 497 



NBi 300 JgL 

p = 100 gL_ _ 

" JPC 
PC _ 
" Jnl" 
nL 



0.00457 (0.000335) 
_0^0q0286 i7.64e_-_05)_ 

0.00371 (0.000263) 
_0^0q286_ (0.000244) 

0.00457 (0.000335) 

0.00314 (0.000274) 



0.0429 (0.000364) 

0_.036_1 io.Oq028_3)_ 
"0.0416 (0.00031) 

0_.0479 _(0.0q0281)_ 
"0.0463 (0.000329) 

0.0504 (0.000403) 



0.784 (0.000508) 

q.79_(0_.0CI0_42_2)_ 
"0.784 (0.000493)" 

q.7_82 (0.CI0_0377) 
"0.783 (0.000448)" 

0.775 (0.000456) 



149 (0.385) 486 

_142_(q.259)_ _ 487 

"148"(0.376)~ ~ 486 

_153_(q.239)_ _ 488 

151 (0.356) 488 

158 (0.374) 485 



NBi 500 JgL 

p = 100 gL_ _ 

" JP""C" 

PC _ 

"Jnl" 

nL 



0.449 (0.00183) 
_0j319_(q.q0217)_ 

0.489 (0.00181) 

0.496 (0.00169) 
"oT5"08 (0.'00"307y 

0.494 (0.00326) 



0.0179 (0.000212) 

0_.0349 _(0_.0q025_) _ 
"0.0148 (0.000171) 

0_.0226 _(0.0q023_5)_ 
"0.0274 (0.000328) 

0.0332 (0.000435) 



0.921 (0.000279) 

q.9_05 (O.CIO_0295) 
"0.925 (0.000276)" 

q.9_2_(0_.0CI0_271)_ 
"0.929 (0.000438)" 

0.927 (0.000435) 



57.1 (0.199) 557 
_75.8 (_0._24_2)_ _ 557 

52.8 (0.189) 558 
_6q.2 (_0._21_4)_ _559 

57.9 (0.348) 567 
62.3 (0.4) 570 



NB2 500 JgL 

p = 100 gL_ _ 

" JP""C" 
PC _ 
" JnL 
nL 



0.008 (0.00041) 
_0^0q0286 _(7.64e_-_05)_ 

0.00886 (0.000448) 
_0Xiq543_ (0.00036) 

0.00829 (0.000464) 

0.00486 (0.00032) 



0.0329 (0.000252) 

^I'^'^l i°l°P°275)_ 
"0.032 (0.000273) 

0_.0404 _(0_.0q027) _ 
"0.038 (0.000283) 

0.043 (0.000334) 



0.87 (0.000216) 

q.8_69 (_0.00_0204) 
"0.87 (0.000205) " 

q.8_65 (_0.00_0202) 
"0.871 (0.000198)" 

0.87 (0.000177) 



95 (0.206) 534 
_96 (p-2U)_ _ _ 534 

94.2 (0.215) 534 
_iq2_(q.207)_ _ 536 

97.3 (0.22) 538 
101 (0.234) 540 
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Figure 9: Graph over a subset of companies in the S&P 100. The positioning of the vertices is chosen so 
that the junction tree based graph is aesthetically pleasing. The edges common in (a) and (b) are marked by 
bold lines and the remaining edges are marked by dashed lines 
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Figure 10: Graph over a subset of companies in the S&P 100. The positioning of the vertices is chosen so 
that the graphical Lasso based graph is aesthetically pleasing. The edges common in (a) and (b) are marked 
by bold hues and the remaining edges are marked by dashed hues. 
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9.2 Analysis of Stock Returns Data 

We applied our metliods to the data set in [12] oi n = 216 monthly stock returns of p = 85 companies in the 
S&P 100. We computed H using k = 1. We applied JgL using EBIC with 7 = 0.5 and applied gL so that 
both graphs have the same number of edges. This allows us to objectively compare the gL and JgL graphs. 
Figure ini shows the two estimated graphs in such a way that the vertices are positioned so that the JgL graph 
looks aesthetically pleasing. In Figure [TUl the vertices are positioned so that gL looks aesthetically pleasing. 
In each graph, we mark the common edges by bold lines and the remaining edges by dashed lines. Some 
conclusions that we draw from the estimated graphs are summarized as follows: 

• The gL graph in Figure [rUT b) seems well structured with multiple different clusters of nodes with 
companies that seem to be related to each other. A similar clustering is seen for the JgL graph in 
Figure El^a) with the exception that there are now connections between the clusters. As observed 
in [l0l[12] , it has been hypothesized that the "actual" graph over the companies is dense since there are 
several unobserved companies that induce conditional dependencies between the observed companies. 
These induced conditional dependencies can be considered to be the weak edges of the "actual" graph. 
Thus, our results suggest that the junction tree based algorithm is able to detect such weak edges. 

• We now focus on some specific edges and nodes in the graphs. The 11 vertices represented by smaller 
squares and shaded in green are not connected to any other vertex in gL. On the other hand, all these 
11 vertices are connected to at least one other vertex in JgL (see Figure [9]). Moreover, several of these 
edges are meaningful. For example, CBS and CMCSA are in the television industry, TGT and CVS 
are stores, AEP and WMB are energy companies, CD and RTN are defense companies, and MDT 
and UNH are in the healthcare industry. Finally, the three vertices represented by larger squares and 
shaded in pink, are not connected to any vertex in JgL and are connected to at least one other vertex 
in gL. Only the edges associated with EXC seem to be meaningful. 

9.3 Analysis of Gene Expression Data 

Graphical models have been used extensively for studying gene interactions using gene expression data |35I51| . 
The gene expression data we study is the Lymph node status data which contains n = 148 expression values 
fromp — 587 genes [26]. Since there is no ground truth available, the main aim in this Section is to highlight 
the differences between the estimates JgL (junction tree estimate) and gL (non junction tree estimate). Just 
like in the stock returns data, we compute the graph H using k = 1. Both the JgL and gL graphs contain 
831 edges. Figure [Til shows the graphs JgL and gL under different placements of the vertices. We clearly see 
significant differences between the estimated graphs. This suggests that using the junction tree framework 
may lead to new scientific interpretations when studying biological data. 

10. Summary and Future Work 

We have outlined a general framework that can be used as a wrapper around any arbitrary undirected 
graphical model selection (UGMS) algorithm for improved graph estimation. Our framework takes as input 
a graph H that contains all (or most of) the edges in G* , decomposes the UGMS problem into multiple 
subproblems using a junction tree representation of H, and then solves subprolems iteratively to estimate 
a graph. Our theoretical results show that certain weak edges, which cannot be estimated using standard 
algorithms, can be estimated when using the junction tree framework. We supported the theory with 
numerical simulations on both synthetic and real world data. All the data and code used in our numerical 
simulations can be found at http://www.ima.uinn.edu/~dvats/JunctionTreeUGMS.html 

Our work motivates several interesting future research directions. In our framework, we used a graph H 
to decompose the UGMS problem into multiple subproblems. Alternatively, we can also focus on directly 
finding such decompositions. Another interesting research direction is to use the decompositions to develop 
parallel algorithms for UGMS for estimating extremely large graphs. Finally, motivated by the differences in 
the graphs obtained using gene expression data, another research problem of interest is to study the scientific 
consequences of using the junction tree framework on various computational biology data sets. 
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(c) Junction tree based graphical Lasso 



(d) Grapliical Lasso 



Figure 11: Graph over genes computed using gene expression data. For (a) and (b), the vertices are chosen 
so that the junction tree estimate is aesthetically pleasing. For (c) and (d), the vertices are chosen so that 
the graphical Lasso estimate is aesthetically pleasing. Further, in (a) and (c), we only show edges that are 
estimated in the junction tree estimate, but not estimated using graphical Lasso. Similarly, for (b) and (c), 
we only show edges that are estimated by graphical Lasso, but not by the junction tree estimate. 
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Figure 12: (a) A graph over eight nodes, (b) The marginal graph over {1, 2, 3, 4, 5}. (c) The marginal graph 
over {4,5,6,7,8}. 

Acknowledgement 

We thank Vincent Tan for discussions and comments on an earlier version of the paper. Divyanshu thanks 
the Institute for Mathematics and its Applications (IMA) for financial support in the form of a postdoctoral 
fellowship. 

Appendix A. Marginal Graph 

Definition A.l. The marginal graph G*'"''[A] of a graph G* over the nodes A is defined as a graph with the 
following properties 

1. E{G*[A]) C£;(G*''"[yl]). 

2. For an edge {i,j) G E{Kj^)\E{G*[A]), if all paths from i to j in G* pass through a subset of the nodes 
m A, then {i,j) ^ G*'"'[A]. 

3. For an edge {i,j) G E{Ka)\E{G*[A]), if there exists a path from i to j in G* such that all nodes in 
the path, except i and j , are in V\A, then {i,j) £ G*'™[A]. 

The graph Ka is the complete graph over the vertices A. The first condition in Definition lA.ll says that 
the marginal graph contains all edges in the induced subgraph over A. The second and third conditions say 
which edges not in G* [A] are in the marginal graph. As an example, consider the graph in Figure [T2la) 
and let A = {1,2,3,4,5}. From the second condition, the edge (3,4) is not in the marginal graph since 
all paths from 3 to 4 pass through a subset of the nodes in A. From the third condition, the edge (4, 5) is 
in the marginal graph since there exists a path {4,8,5} that does not go through any nodes in A\{4,5}. 
Similarly, the marginal graph over A — {4, 5, 6, 7, 8} can be constructed as in Figure [TWc). The importance 
of marginal graphs is highlighted in the following proposition. 

Proposition A.l. If Px > is Markov on G* — {V,E{G*)) and not Markov on any subgraph of G* , then 
for any subset of vertices A C V , Pxa ^^ Markov on the marginal graph G*'"^[A\ and not Markov on any 
subgraph o/G*''"[^]. 

Proof Suppose Pxa is Markov on the graph Ga and not Markov on any subgraph of Ga- We will show 

that Ga = G™[^]. 

• If {i,j) e G, then Xi X Xj\Xs for every S C V\{i,i}. Thus, G[A] C Ga- 

• For any edge {i,j) G ii'/i\G[74], suppose that for every path from i to j contains at least one node from 
A\{i, j}. Then, there exists a set of nodes S C A\{i,j} such that Xi i Xj\Xs and (i, j) ^ Ga- 

• For any edge {i,j) G i4r^\G[A], suppose that there exists a path from i to j such that all nodes in the 
path, except i and j, are in V\A. This means we cannot find a separator for i and j in the set A, so 

(»,j)eGA. 

From the construction of Ga and Definition lA.ll it is clear that Ga = G™ [A] . D 
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Using Proposition lA.ll it is clear that if the UGMS algorithm ^ in Assumption [T] is applied to a subset 
of vertices A, the output will be a consistent estimator of the marginal graph G*'™[A]. Note that from 
Definition lA.ll although the marginal graph contains all edges in G* [A] , it may contain additional edges as 
well. Given only the marginal graph G*'™[A], it is not clear how to identify edges that are in G'*[A]. For 
example, suppose G* is a graph over four nodes and let the graph be a single cycle. The marginal graph 
over any subset of three nodes is always the complete graph. Given the complete graph over three nodes, 
computing the induced subgraph over the three nodes is nontrivial. 

Appendix B. Examples of UGMS Algorithms 

We give examples of standard UGMS algorithms and show how they can be used to implement step 3 in 
Algorithm [5] when estimating edges in a region of a region graph. For simplicity, we review algorithms for 
UGMS when Px is a Gaussian distribution with mean zero and covariance S*. Such distributions are referred 
to as Gaussian graphical models. It is well known _41| that that the inverse covariance matrix &* = {'E,*)~^, 
also known as precision matrix, is such that for all i ^ j, 0* 7^ if and only if (j,j) € E{G*). In other 
words, the graph G* can be estimated given an estimate of the covariance or inverse covariance matrix of 
X. We review two standard algorithms for estimating G*: graphical Lasso and neighborhood selection using 
Lasso (nLasso). 

B.l Graphical Lasso (gLasso) 

Define the empirical covariance matrix Sa over a set of vertices A C V^ as follows: 

SA-'-±xf{xfy . (33) 

fc=i 

Recall from Algorithm [51 we apply a UGMS algorithm R to estimate edges in _ff^ defined in ([3]) . The 
graphical Lasso (gLasso) estimates En by solving the following convex optimization problem: 



_ \ logdet(e) - trace (%X^) ' ^ Y^ ®*J \ 



e = arg min { logdet(e) - trace S'-^X^ - A > e„ > (34) 

e^o,eij=ov(jj)^i ' 

Er^{{i,j)(.H'j,:%,,^0}. (35) 

The graph H''^[R] is the marginal graph over R (see Appendix R)) . When R — V, H — Ky, and fl"^ — Ky, 
the above equations recover the standard gLasso estimator, which was first proposed in [5]. Equation (|34)) 
can be solved using algorithms in [5j|T7l|40l[55]. Theoretical properties of the estimates Q and En have 
been studied in [38]. Note that the regularization parameter in (l34l) controls the sparsity of Eji. A larger A 
corresponds to a sparser solution. Further, we only regularize the terms in Qtj corresponding to the edges 
that need to be estimated, i.e., the edges in H'j^. Finally, Equation (p4l) also accounts for the edges H by 
computing the marginal graph over R. In general, H™ [R\ can be replaced by any graph that is superset of 
H"' \R\ . 

B.2 Neighborhood Selection (nLasso) 

Using the local Markov property of undirected graphical models (see Definition 12. 1|) , we know that if Px 
is Markov on G*, then P (X^ | Xy\i) = P (X^ | X„eG.(o)- This motivates an algorithm for estimating the 
neighborhood of each node and then combining all these estimates to estimate G* . For Gaussian graphical 
models, this can be achieved by solving a Lasso problem [45] at each node [33]. Recall that we are interested 
in estimating all edges in H'j^ by applying a UGMS algorithm to R. The neighborhood selection using Lasso 
(nLasso) algorithm is given as follows: 

H" = Kj^\H"' [R\ (36) 
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^' = arg min <^ ||X^ - X"/?||^ + A ^ |A| ^ (37) 

a'- = {i : ^f ^ 0} (38) 

^«= U {(fc,*):*e^^} ■ (39) 

Notice that in the above algorithm if i is estimated to be a neighbor of j, then we include the edge {i,j) 
even if j is not estimated to be a neighbor of i. This is called the union rule for combining neighborhood 
estimates. In our numerical simulations, we use the intersection rule to combine neighborhood estimates, 
i.e., (i, j) is estimated only if i is estimated to be a neighbor of j and j is estimated to be a neighbor of i. 
Theoretical analysis of uLasso has been carried out in [331[S^. Note that, when estimating the neighbors of 
a node fc, we only penalize the neighbors in H'j^. Further, we use prior knowledge about some of the edges by 
using the graph H in (l37l) . References [7l [34l[37] extend the neighborhood selection based method to discrete 
valued graphical models. 

Appendix C. Proof of Proposition 15.11 

We first prove the following result. 

Lemma C.l. For any {i,j) G H'j^, there either exists no non-direct path from i to j in H or all non-direct 
paths in H pass through a subset of R. 

Proof We first show the result for R € TZi. This means that R is one of the clusters in the junction 
tree used to construct the region graph and ch{R) is the set of all separators of cardinality greater than one 
connected to the cluster R in the junction tree. Subsequently, R = R. If ch{R) = 0, the claim trivially holds. 
Let ch{R) 7^ and suppose there exists a non-direct path from i to j that passes through a vertex k ^ R. 
Then, there will exist a separator S in the junction tree such that S separates {i,j} and k. Thus, all paths 
in H from i and j to k pass through S. This implies that either there is no non-direct path from i to j in 
H or else we have reached a contradiction. 

Now, suppose R G TZi ioT I > 1. The set an{R) contains all the clusters in the junction tree than contain 
R. From the running intersection property of junction trees, all these clusters must form a subtree in the 
original junction tree. Merge R into one cluster and find a new junction tree J' by keeping the rest of the 
clusters the same. It is clear R will be in the first row of the updated region graph. The arguments used 
above can be repeated to prove the claim. D 



We now prove Proposition 15. II 

Case 1: Let (i,j) G H'p, and (i,j) ^ G* . If there exists no non-direct path from i to j in iJ, then the edge 
(i,j) can be estimated by solving a UGMS problem over i and j. By the definition of R, i,j S R. Suppose 
there does exist non-direct paths from i to j in H. From Proposition IC.li all such paths pass through R. 
Thus, the conditional independence of Xi and Xj can be determined from X^^^,^ 1 . 

Case 2: Let {i,j) e H'j^ and (i, j) £ G* . From Proposition lC.ll and using the fact that E{G*) C E{H), we 
know that all paths from i to j pass through R. This means that if Xi JL Xj\X^x^,_^ ■-, , then Xi JL Xj\Xy\uj\ . 

Appendix D. Analysis of the PC- Algorithm in Algorithm |4] 

In this Section, we present analysis of Algorithm |4] using results from [1] and |20j . The analysis presented 
here is for the non-junction tree based algorithm. Throughout this Section, assume 

G^PC{ri,X",Kv,Kv), 

where Ky is the complete graph over the vertices V. Further, let the threshold for the conditional indepen- 
dence test in ([6]) be A„. We are interested in finding conditions under which G = G* with high probability. 
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Theorem D.l. Under Assumptions (Al)-(A5), there exists a conditional independence test such that if 

n^^ip7nLv^og{p)), or (40) 



Prmn = ^( V^ log(p)/n) , (41) 



then P{G 7^ G) ^- as n —> oo. 



Define the set B,, as follows 



B^ = {(z, i, S):i,jeV,t^ J, S C V\{t,j}, \S\ < r,} . (42) 

The following concentration inequality follows from [1]. 
Lemma D.l. Under Assumption (A4-), there exists constants ci and C2 such that for e < M, 

sup P {\\p^J\s\ - \Pij\s\\ > ?) < ci exp {-C2{n - 77)^^) , (43) 

where n is the number of vector valued measurements made of Xi,Xj, and Xs- 

Proof See p. D 

Let Pe = P{G 7^ G), where the probability measure P is with respect to Px- Recall that we threshold 
the empirical conditional partial correlation Pij\s to test for conditional independence, i.e., 'Pij\s < A„ =^ 
Xi i Xj\Xs. An error may occur if there exists two distinct vertices i and j such that either Pij\s = and 
\Pij\s\ > A„ or \pij\s\ > and \pij\s\ < A„. Thus, we have 

Pe < P{£l) + P{£2) , (44) 

PiSi) = P U i^ ^ «•*■ \P^^\s\ > ^n} (45) 

P{E2) = P [ U {3 5 s.t. |p,,|s| < A„} I . (46) 

We will find conditions under which P{£i) — ^ and P{£2) ~> which will imply that Pg -—> 0. The term 
P{£i), probability of including an edge in G that does not belong to the true graph, can be upper bounded 
as follows: 

P{£i)<p[ U {3 5s.t. |ft,|s|>A„} <P U {|p.,|sl>A4 (47) 

<p^+' sup P(|a,|s|>A„) (48) 

< cip"+2 exp (-C2(n - 7])Xl) = a exp ((r/ + 2) log(p) - C2{n - t])\1) (49) 

The terms p''+^ comes from the fact that there are at most p^ number of edges and the algorithm searches 
over at most p^ number of separators for each edge. Choosing A„ such that 

li,, i^-J>^n (50) 

n,p->oo [r] + 2) log(p) 
ensures that P{£i) — )■ as n,p — )■ 00. Further, choose A„ such that for C3 < 1 

A„ < C-iPmin ■ (51) 



28 



The term P{£2), probability of not including an edge in G that does belong to the true graph, can be upper 
bounded as follows: 

P{S2)<P{ U {3 5 S.t. |py|s| < A„} < P I U \P^J\S\-\'P^J\S\>\P^J\S\-K^ (52) 

\(*j)eG / \(ij)eG,scv\{jj} / 

<p''+2 sup P(|p„|s|-Ift,|5l>|p.,|5|-A„) (53) 

(ij,s)es„ 

< p''+^ sup P {\\Ptj\s\ - \Pij\s\\ > Pmtn " A„) (54) 

< cip''+^ exp (-C2(n - ri)ip,nin - A„)^) = ci cxp ((77 + 2) log(p) - c4n - v)Pmin) ■ (55) 

To get ((55|) . we use ([STjl so that (pmm — A„) > (1 — cz)pmin- For some constant C5 > 0, suppose that for all 
n > n' and p > p', 

C4,{n - v)pLui > (?? + 2 + C5) log(p) . (56) 

Given ([5^ . ^(^2) — > as n,p -> 00. In asymptotic notation, we can write ([55)1 as 

" = ^(P™„'7log(p)) (57) 



which proves the Theorem. The conditional independence test is such that A„ is chosen to satisfy (I50p and 
(I5T]) . In asymptotic notation, we can show that A„ = 0{pmin) and A^ = fi (7?log(p)/n) satisfies ([50]) and 

(ED. 



Appendix E. Proof of Theorem 18.11 

To prove the theorem, it is sufhcient to establish that 



po = ^ (v/^Tlog(p)/nj (58) 

Pi - f^ (v/?7log(pi)/n) (59) 

P2 = f^ (v/??log(p2)/n) (60) 

PT = 0.[y/ri\og{pT)ln] . (61) 



Let H be the graph estimated in Step 1. An error occurs if for an edge (i, j) G G* there exists a subset 
of vertices 5* such that 15*1 < t^t and Pij\s > A°. Using the proof of Theorem lD.il (see analysis of P{£2)), 
it is easy to see that n = Q{pQ'^r]T\og{p)) is sufficient for P{E{G*) (t E{H)) ^- as n — > 0. Further, the 
threshold is chosen such that A^ — 0{po) and (A^)^ = i} {rjT ^og{p) / n) . This proves (1551) . 

In Step 2, we estimate the graphs Gi and G2 by applying the PC- Algorithm to the vertices Vi U T and 
V2 U T, respectively. For Gi , given that all edges that have a separator of size rjT have been removed, we can 
again use the analysis in the proof of Theorem ID. II to show that for A^ = 0{pi) and (A^)^ = il {ri\og{pi)/n), 
n = n{p-^Tj\og{pi)) is suSicient for P(Gi ^ G*[Vi U T]\Kt)\G* C ff) ^ as n ^ 00. This proves ^. 
Using similar analysis, we can prove (|5(I1) and (pT|) . 

The probability of error can be written as 



Pe < P{G* ^H) + Y1 ^(^fe ^ ^*[^fe U T]\i^T|G* c H) 
fc=i 

+ P(Gt 7^ G*[r]|G* c F, G = G[Fi U T]*\Kt, G*[V2 U T] = G[F2 U r]\iCT) • (62) 

Given ((58)) -(|6T |) . each term on the right goes to as n — J- cx), so Pg — ^ as n — > 00. 
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